The Gotcha moments in managing go routines

Smell

package main

import (
	"fmt"
	"time"
)

func main() {
	done := make(chan bool)
	go func(done chan bool) {
		childDone := make(chan bool)
		go func(childDone chan bool) {
			// Simulate work.
			time.Sleep(2 * time.Second)
			fmt.Println("Child goroutine work done!")
			childDone <- true
		}(childDone)
		// Wait for the child goroutine to finish.
		<-childDone
		fmt.Println("Parent goroutine received child done signal.")
		// Pass the child's "done" signal back up to the main goroutine.
		done <- true
	}(done)

	// Wait for the parent goroutine to signal that the child is done.
	<-done
	fmt.Println("Main goroutine received done signal.")
}

A while ago, I was reviewing some code from an intern, and his code remotely resembled this. I remember writing in a similar manner before – haha. So, I guess it's a good time to reflect on the previous version of myself.

Let's discuss why this code might be considered smelly:

Nested Goroutines Without a Clear Reason.
It is quite common to have nested goroutines, call nested synchronized functions, or take mutex control in a nested manner, whether you're trying to secure sequential access to some resources or modify a tree node's children. Therefore, it's sometimes inevitable to engage in such patterns.
However, in this case, spawning a goroutine within another goroutine without a clear reason can lead to confusion and difficulty in understanding the program's flow.
Overcomplicated Channel Usage.
The Go channel is very handy for communicating changes/data and serves as a primitive synchronization mechanism. However, its use is not necessary based on semantics, and even when necessary, the semantics of using Go channels may not be the best choice. Either a waitgroup or a close channel operation could be a better alternative due to their semantic meaning. Overuse of channels, regardless of context and semantic meaning of the operations, brings an unnecessary mental burden.
Hardcoded Sleep.
Hardcoding any value in the codebase is an anti-pattern, and it could also lead to a data race if the developer expects a certain arrangement of the goroutine during runtime.
Lack of Error Handling.
There is no error handling in the goroutines. In real-world applications, one would typically want to handle potential errors and relay them back to the main function
Channel as a "Handle".
Using channels as a "[[handle]]" for goroutines is notidiomatic in Go. Typically, we don't think of goroutines as having handles; instead, we use channels to synchronize with or communicate between them.

Better code, but ...

As we have seen the previou working but not really working exmaple, lets have a look at a more production level go code, which is exempt from all the code smells we have talked about, but there is no silver bullet

package main

import (
    "context"
    "sync"
    "time"
)

type ErrorMut struct {
    lock sync.Mutex
    err  error
}

func (errorMutex *ErrorMut) setError(err error) {
    errorMutex.lock.Lock()
    defer errorMutex.lock.Unlock()
    errorMutex.err = err
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    var wg sync.WaitGroup

    stopJobChan := make(chan struct{})
    errorMut := &ErrorMut{}
    wg.Add(1)
    go worker(ctx, stopJobChan, &wg, errorMut)

    // Assuming 'businessLogic' is some condition that should be defined elsewhere
    // if businessLogic {
    //     close(stopJobChan)
    // }

    wg.Wait()

    // Handle error if any
    if errorMut.err != nil {
        // handle error
    }
}

func worker(ctx context.Context, stopJobChan <-chan struct{}, wg *sync.WaitGroup, errorMutex *ErrorMut) {
    defer wg.Done()

    for {
        select {
        case <-ctx.Done():
            return
        case <-stopJobChan:
            return
        default:
            // some business logic

            // error path
            // error := ... // Get an error from somewhere
            // errorMutex.setError(error)
        }
    }
}

Let's evaluate this updated code snippet to see if it addresses our previous concerns effectively:

Nested Goroutines Without a Clear Reason Pending
The updated code does not include any nested goroutines, which means it avoids this specific issue.
Overcomplicated Channel Usage Check
The stopJobChan is an idiomatic way to signal the worker to stop.
Hardcoded Sleep Check
There is no hardcoded sleep in the updated code. The use of context.WithTimeout is an appropriate method for setting timeouts on operations, as it is preferable to hardcoded sleep. It provides a way to cancel the operation if the context completes before the timeout. This concern has been addressed.
Lack of Error Handling Check
The updated code includes a structure for error handling through the ErrorMut type and checks for an error at the end of the main function.

Summary

We introduced four elements to "solve" the problem.

The context for managing timeouts.
Individual stop signals for each routine.
A countdown that oversees the entire operation.
An error object for each routine.

However, if we scale up to, say, a million goroutines, we would face the challenge of managing:

A large number of stop signals, one for each goroutine.
A significant number of error objects, potentially one for each goroutine.
The complexity of spawning and managing nested goroutines would still be an issue.

Go for production

In real-life scenarios, we usually face two main types of problems when dealing with concurrent programming:

We are concerned with the result of each individual outstanding task. For example, checking inventory stocks or applying guard clauses for condition checking.
We focus on the overall progress of the task. For example, traversing a file system or calculating Conway's Game of Life.

Consider the following scenario:

A Go routine spawns two child goroutines, and each of those child goroutines, in turn, spawns two more, creating a structure that is two levels deep. This setup implements cascading cancellation and aggregates the results from the child goroutines to the parent routine.

package main

import (
	"context"
	"fmt"
	"sync"
	"time"
)

// WorkResult holds the result of the work a goroutine did
type WorkResult struct {
	ID     int
	Result string
}

// recursiveSpawn starts two new goroutines if the current depth is less than maxDepth.
// Each goroutine does some work and, if not at the maxDepth, spawns two more goroutines.
func recursiveSpawn(id, currentDepth, maxDepth int, wg *sync.WaitGroup, ctx context.Context, resultChan chan<- WorkResult) {
	defer wg.Done()

	// Do work at the current level
	select {
	case <-time.After(1 * time.Second): // Simulate work by sleeping
		resultChan <- WorkResult{ID: id, Result: "completed"}
	case <-ctx.Done():
		resultChan <- WorkResult{ID: id, Result: "cancelled"}
		return // If cancelled, don't spawn more goroutines
	}

	// If this is not the last level, spawn two more goroutines
	if currentDepth < maxDepth {
		for i := 0; i < 2; i++ {
			childID := id*10 + i
			wg.Add(1)
			go recursiveSpawn(childID, currentDepth+1, maxDepth, wg, ctx, resultChan)
		}
	}
}

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) // Cancel after 3 seconds
	defer cancel()

	var wg sync.WaitGroup
	resultChan := make(chan WorkResult) // Unbuffered channel for results

	wg.Add(1)
	go recursiveSpawn(1, 0, 3, &wg, ctx, resultChan) // Spawn goroutines to a depth of 3

	// In a separate goroutine, collect all results and print them
	go func() {
		for result := range resultChan {
			fmt.Printf("Routine %d: %s\n", result.ID, result.Result)
		}
	}()

	wg.Wait()             // Wait for all goroutines to finish
	close(resultChan)     // Close the result channel to stop the result collecting goroutine
	fmt.Println("All goroutines have been handled.")
}

The example we're looking at presents a challenge in terms of control granularity within the spawning hierarchy of goroutines. For instance, when you consider Spawn (id:100, depth:2), there's a limitation: it lacks the capability for targeted cancellation. You can't cancel this specific goroutine and its subsequent child routines without affecting the entire tree.

Likewise, when it comes to extracting and processing results from Spawn (id:1000, depth:3) and Spawn (id:1001, depth:3), the setup doesn't lend itself to an efficient workflow. To gain the necessary control, you'd need to implement distinct contexts for each branch—a task that adds complexity and demands careful consideration of resource management.

Conclusion

When exploring the design of concurrent systems in Go, we might contemplate the implications of spawning goroutines dynamically. It's a balancing act—leveraging Go's concurrency strengths (lightweight processes, ease of use) against the potential for overextension of system resources (limited memory, CPU limits, running out file descriptors). Is it prudent to allow a system to operate with such a level of unpredictability?

Consideration of error-propensity (debugging complexity, synchronization pitfalls) is also part of the equation. Implementing concurrency controls (think semaphores, worker pools) could offer a structured approach to harness this power. It's an open question: How do we best navigate these trade-offs to achieve both scalability and reliability?